fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…#174
Conversation
…1,...) otherwise wandb creates multiple folders and crashes.
WalkthroughModified GPU device management in the training module to conditionally restrict to a single GPU when using wandb. Addresses multi-GPU sweep crashes by setting Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
icu_benchmarks/models/train.py (1)
157-158: Pre-existing bug: Condition always evaluates to True.This is not part of your changes, but there's a logic error here. The string
"16-mixed"is always truthy, so this condition always executes regardless of the precision value.Should be:
- if precision == 16 or "16-mixed": + if precision == 16 or precision == "16-mixed":Note: This bug may have minimal impact since setting matmul precision to "medium" is often desirable anyway, but the logic should still be corrected for clarity and correctness.
🧹 Nitpick comments (1)
icu_benchmarks/models/train.py (1)
145-148: Optional: Consider clarifying the wandb logging message.The device restriction correctly prevents wandb from creating multiple folders on multi-GPU systems. The implementation is sound.
The log message is accurate but could be more specific:
- logging.info("Use of wandb is detected. Only single gpu training is supported with wandb.") + logging.info("Wandb detected. Restricting training to a single device to prevent sweep crashes.")This makes the workaround nature explicit and uses "device" instead of "gpu" for accuracy across hardware types.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
icu_benchmarks/models/train.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
icu_benchmarks/models/train.py (1)
icu_benchmarks/imputation/simple_diffusion.py (1)
on_fit_start(117-126)
🔇 Additional comments (2)
icu_benchmarks/models/train.py (2)
143-144: Device initialization logic looks good.The initialization correctly defaults to at least 1 device and uses the available CUDA device count when GPUs are present. This provides a sensible baseline before the wandb-specific override.
166-166: Correct usage of the computed devices variable.The dynamic device assignment properly integrates with the wandb-aware device management logic introduced earlier.
When wandb detected pass to Trainer(...,devices=1,...) otherwise wandb creates multiple folders and crashes. Fixes #173.
Summary by CodeRabbit